[Model] Add Ming-omni-tts dense 0.5B pipeline#2906
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
I guess everyone is suffering under the new limits (╥_╥) |
hsliuustc0106
left a comment
There was a problem hiding this comment.
This PR is marked as [WIP] and is substantial (~10,500 lines / 47 files).
Could you please run the L3 tests locally and paste the results here? This helps validate the integration on your end before we proceed with full review.
hsliuustc0106
left a comment
There was a problem hiding this comment.
please make changes accordingly after #2383 merged. For the model usage, I suggest to write a model recipe under vllm_omni/recipes using the template. It seems there are some duplicate/dead codes as well, can you try to compress it first?
|
I also recommend you to use the add-tts-models skill |
linyueqian
left a comment
There was a problem hiding this comment.
Thanks for the thorough test matrix and the warm-cache RTF numbers, those are the right kind of evidence for a model-add PR. At 10.5k additions the PR is hard to review carefully. I think it can stay as one PR if we condense it by reusing modules that already live in the repo. Inline comments below on the specific files, ordered roughly by expected line savings.
Not blocking merge, flagging for the author and maintainers.
| @@ -0,0 +1,207 @@ | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
[MAJOR] cosyvoice3/code2wav_core/cfm.py (325 lines) already implements Conditional Flow Matching. This PR adds fm/cfm.py (207) plus fm/modules.py (147), for roughly 350 lines of duplicated logic.
Suggestion: promote the cosyvoice3 CFM plus a DiT base to vllm_omni/model_executor/modules/flow_matching/, have Ming import it, and keep only fm/dit.py (Ming-specific conditioning) and fm/flowloss.py here.
This is a cross-model refactor, fine to land as a prerequisite PR owned by a maintainer or cc @yuanheng-zhao rather than blocking Ming on it. Worth an issue link from the PR body at minimum.
There was a problem hiding this comment.
Sure, I’ll file/link a follow-up issue for promoting the shared CFM/DiT base into vllm_omni/model_executor/modules/flow_matching/, unless you prefer this to be a prerequisite PR before Ming lands.
| @@ -0,0 +1,188 @@ | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
[MINOR] Pure math. qwen3_tts/tokenizer_25hz/ and voxtral_tts/voxtral_tts_audio_tokenizer.py also ship an iSTFT. Recommend opening a follow-up issue to migrate all three to a shared vllm_omni/model_executor/modules/audio/stft.py. Not a blocker on this PR, but please file the issue so this does not go cold.
There was a problem hiding this comment.
Sure! I’ll file/link a follow-up issue to migrate the repeated iSTFT implementations into a shared vllm_omni/model_executor/modules/audio/stft.py.
|
@akshatvishu It seems there're a lot added files that could re-use modules from the talker of I'll update #2890 later today and try to merge it ASAP and then you might want to rebase |
|
@yuanheng-zhao Sure, I will wait for #2890 to get merge and will then start working on the suggestion left by @linyueqian as it seems like I can borrow a lot from |
|
Hey @akshatvishu , the git rebase --onto main the-talker-branch your-current-branchNote to fetch and have latest main and my branch on your local |
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
d949ec7 to
9add4ef
Compare
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
…s signature Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
…tecture Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
…to-detection fails Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com> # Conflicts: # vllm_omni/engine/async_omni_engine.py # vllm_omni/entrypoints/openai/serving_speech.py
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Offline: promote AsyncOmni to a module-scoped fixture so the streaming
test shares the engine init with the other three tests instead of
paying
a fresh two-stage load each run (~30 min → ~15 min on L4). Also cleans
up the inline try/finally that the fixture teardown now handles.
Online: replace four-level Path(__file__).parent chain with
get_deploy_config_path("ming_tts.yaml"), matching the convention used
by cosyvoice3 and moss_tts_nano. Drops the now-unused pathlib import.
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
|
Thanks for the incredibly thorough review @linyueqian ! I've just pushed a batch of commits that resolves all the [HIGH], [MEDIUM] and [LOW] architectural feedback. Regarding the +7,300 LOC and deduplication: Since extracting the common components into ming_utils/ means touching the existing ming_flash_omni architecture, I want to be careful not to break its CFMGraphExecutor. I'm working on this refactor now. Give me a day or two to move the shared logic, test it locally against both models and trim down the docs. I'll ping you for a re-review once it's pushed! |
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
|
Hi @linyueqian , I split out the shared Ming modules as suggested but while validating the change I hit existing Ming Flash Omni failures on ROCm. Tested with the official ROCm Docker image on an MI300X x8 node, provided through the AMD Developer Cloud program. Thanks to the AMD developer program team for granting access to the node. All expansion tests initially failed (compatibility issues in the existing Ming Flash Omni path against the current vLLM/vLLM-Omni + ENV:`python collect_env.py````bash ============================== System Info ============================== OS : Ubuntu 22.04.5 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0 Clang version : 22.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-7.2.2 26084 f58b06dce1f9c15707c5f808fd002e18c2accf7e) CMake version : version 3.31.10 Libc version : glibc-2.35==============================
|
…dense # Conflicts: # tests/model_executor/stage_input_processors/test_qwen3_omni_streaming_helpers.py # vllm_omni/model_executor/models/ming_flash_omni/ming_flash_omni_thinker.py # vllm_omni/model_executor/stage_input_processors/ming_flash_omni.py
|
Hi @yuanheng-zhao ! Incase the pytest failures also reproduce on CUDA with the latest git cherry-pick 149b0c09 55c1b124They seem to resolve the issue on the ROCm side. Please let me know if you’d prefer a different approach. I’m happy to make the changes. |
Sure, I think I could get back within 12 hrs. Btw, may I have the versions of vllm, vllm-omni, and transformers for your env for reference? Thanks. |
transformers==5.8.1, vLLM Version : 0.22.0, vLLM-Omni Version : 0.1.dev1818+gbc49be130.rocm (git sha: bc49be1, date: ocm) Also, the output of I also tested this yesterday with vllm== Also, the sampling_metadata parameter become outdated from vllm after this change:vllm-project/vllm@1c3ffdb P.S. I also pinged you on the vLLM Slack in case it’s easier to follow up there. |
|
Hey @akshatvishu , I did reproduce errors with both |
…dense Signed-off-by: akshatvishu <akshatnayak197@gmail.com> # Conflicts: # vllm_omni/entrypoints/openai/serving_speech.py # vllm_omni/worker/gpu_ar_model_runner.py
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
Adapt the Ming Flash Omni talker compatibility fixes suggested in PR vllm-project#4080. Suggested-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com> Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
| if deploy_config_path is not None: | ||
| _deploy_path = Path(deploy_config_path) | ||
| if _deploy_path.exists(): | ||
| _deploy_cfg = load_deploy_config(_deploy_path) | ||
| if _deploy_cfg.pipeline and _deploy_cfg.pipeline in _PIPELINE_REGISTRY: | ||
| return cls._create_from_registry(_deploy_cfg.pipeline, cli_overrides, deploy_config_path) | ||
|
|
There was a problem hiding this comment.
PTAL @alex-jw-brooks at the changes related with stage configs, do we currently have ongoing logics to handle this?
Suggested-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com> Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
There was a problem hiding this comment.
Why is the Ming model utils defined as a separate model executor? Could there be a better place for it? We should keep it aligned with the repository structure.
There was a problem hiding this comment.
Will this be more ideal?
From: vllm_omni/model_executor/models/ming_utils/
To: vllm_omni/model_executor/models/common/ming/
There was a problem hiding this comment.
Will this be more ideal? From: vllm_omni/model_executor/models/ming_utils/ To: vllm_omni/model_executor/models/common/ming/
yea, probably!
There was a problem hiding this comment.
Thanks for the suggestion @Nightwing-77 ! Just pushed the commit for the same.
Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
3b8d741 to
34d13eb
Compare
| description="Language code (e.g., 'Chinese', 'English', 'Auto')", | ||
| ) | ||
| ref_audio: str | None = Field( | ||
| ref_audio: str | list[str] | None = Field( |
There was a problem hiding this comment.
does ming TTS support multiple ref audio!?
There was a problem hiding this comment.
Yeah, it support stuff like podcast/multi-speaker TTS scenarios! You can check https://github.com/inclusionAI/Ming-omni-tts/blob/94a4d409/cookbooks/cookbook.ipynb#L192-L196 for more info!
Purpose
Add Ming-omni-tts dense 0.5B support to vLLM-Omni via a two-stage AR+Flow → Audio VAE pipeline.
Original repo : https://github.com/inclusionAI/Ming-omni-tts
Resolves:
#1461
Changes:
Model files (
vllm_omni/model_executor/models/ming_tts/)ming_tts.py— top-level two-stage dispatcher and weight-loading entry pointming_tts_llm.py— Stage-0 Qwen2 AR backbone with inline Aggregator, FlowLoss, stop head, and latent patch emissionming_tts_audio_vae.py— Stage-1 Audio VAE decoder producing 44.1 kHz mono waveform outputconfig_ming_tts.py— Ming dense constants, runtime keys, latent sizes, token IDs, stop-head defaults, and sample-rate validationconfiguration_ming_dense.py— Hugging Face config adapter forinclusionAI/Ming-omni-tts-0.5Bprompt_builder.py— prompt construction for speech, music, instructions, TTA, prompt waveform, and speaker embeddingsingress.py— first-stage prompt ingestion for the disaggregated pipelinespeaker_extractor.py— CampPlus 192-d speaker embedding extraction for reference audiofm/— Flow Matching modules used by Stage-0 latent generationaudio_tokenizer/— Ming Audio VAE tokenizer and decoder support modulesRegistry
MingTTSForConditionalGeneration,MingLLMModel, andMingAudioVAEModelinvllm_omni/model_executor/models/registry.pyStage config & input processors
vllm_omni/model_executor/stage_configs/ming_tts.yaml— sequential two-stage AR+Flow → Audio VAE pipelinevllm_omni/model_executor/stage_configs/ming_tts_async_chunk.yaml— async chunk pipeline with SharedMemoryConnector,latent_chunk_size: 25, andmax_num_seqs: 1vllm_omni/model_executor/stage_input_processors/ming_tts.py— Stage-0 → Stage-1 latent patch transfer forllm2audio_vaeandllm2audio_vae_async_chunk, including final partial chunk flushOffline examples
examples/offline_inference/ming_tts/end2end.py— end-to-end Omni example covering 11 cookbook cases:style,ip,bgm,tta,emotion,basic,dialect,zero_shot,podcast,speech_bgm,speech_soundexamples/offline_inference/ming_tts/README.md— offline launch notes for sequential and async chunk runsOnline serving
vllm_omni/entrypoints/openai/serving_speech.py— Ming prompt builder for OpenAI-compatible/v1/audio/speech, with structured instructions,voice→ IP,language→ dialect, reference audio, 192-d speaker embeddings, podcast multi-speaker conditioning, and streaming PCM/WAV outputexamples/online_serving/ming_tts/run_server.sh— async chunk server launch scriptexamples/online_serving/ming_tts/openai_speech_client.py— API client covering Ming controls and streaming outputexamples/online_serving/ming_tts/run_curl.sh— curl examples for/v1/audio/speechexamples/online_serving/ming_tts/README.mdanddocs/user_guide/examples/online_serving/ming_tts.md— online serving documentationArchitecture:
Known limitations / follow-ups:
/v1/audio/speechdoes not yet exposeprompt_mode=music/ttaor FlowLoss controls (cfg,sigma,temperature); online BGM and TTA require a future prompt-mode API extension.max_num_seqs: 1; multi-request batching is not yet validated.latent_chunk_size: 5improves online TTFP significantly but diverges onpodcastin the offline async matrix; repo YAML stays on the validatedlatent_chunk_size: 25default until that is resolved.Test Plan
Validation was performed on an NVIDIA L4 GPU (Colab).
Offline sequential — full 11-case cookbook matrix :
Offline async_chunk — full 11-case cookbook matrix:
python examples/offline_inference/ming_tts/end2end.py \ --case <case> \ --streaming \ --stage-configs-path vllm_omni/model_executor/stage_configs/ming_tts_async_chunk.yaml \ --enforce-eagerOnline serving —
/v1/audio/speechasync_chunk checks:Test Result
Offline correctness — sequential vs. async_chunk (
latent_chunk_size: 25):All 11 cases produced identical frame counts and Stage-1 total patch counts between sequential and default async_chunk, confirming correct Stage-0 → Stage-1 handoff and final partial chunk flush.
styleipbgmttaemotionbasicdialectzero_shotpodcastspeech_bgmspeech_soundUpstream FlashAttention comparison (cold, single-request, L4):
Upstream: torch 2.6.0+cu124, FlashAttention 2.7.4.post1. vLLM-Omni VAE stage runs through SDPA, not upstream FlashAttention. Integration comparison, not kernel parity benchmark.
styleipbgmemotionbasicdialectzero_shotpodcastspeech_bgmspeech_soundvLLM-Omni matches or beats upstream RTF on
bgm; async25 is near-parity onstyle,zero_shot, andpodcast. Cold single-request numbers include engine startup and first-request lazy setup costs.Warm-cache RTF vs upstream (L4, post-warmup, 1 warmup + 1 measured request):
Warm-cache removes first-request lazy setup. Fairer per-request comparison against upstream.
styleipbgmemotionbasicdialectzero_shotpodcastspeech_bgmspeech_soundWarm vLLM-Omni sequential beats upstream FlashAttention RTF across all 10 measured cases. Async25 further reduces RTF for longer/reference-conditioned cases and the zero-ref
style/bgmruns.Warm-cache offline benchmark (L4, 1 warmup + 1 measured request):
styleipbgmzero_shotpodcastttabasicAsync chunk benefits longer/reference-conditioned cases; overhead roughly cancels the overlap benefit for short speech cases.
Online serving benchmark (10 prompts, concurrency 1, eager, L4):
sequential_eagerasync_chunk_eager(chunk=25)async_chunk_bench(chunk=5)latent_chunk_size: 5reduces mean TTFP by ~73% and E2E by ~11% vs. sequential, but remains experimental pending podcast offline finalization.Online
/v1/audio/speechvalidation (async_chunk, all speech-mode cases):All cases returned valid WAV at 44.1 kHz. Streaming PCM returned progressive chunks. Reference audio, speaker embedding, and podcast multi-reference checks passed.
styleipbasicemotiondialectzero_shotpodcastspeech_bgmspeech_soundstreamingBEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)